This tutorial demonstrates how to use R to perform data analytics tasks covered in Week 02. We’ll explore:
By the end of this tutorial, you will be able to:
# Install packages if needed (uncomment to install)
# install.packages(c("tidyverse", "ggplot2", "dplyr", "knitr", "kableExtra"))
# install.packages("seedhash") # For reproducible seed generation
# Load required packages
library(tidyverse) # Data manipulation and visualization
library(ggplot2) # Advanced plotting
library(dplyr) # Data manipulation
library(knitr) # Table formatting
library(kableExtra) # Enhanced tables
library(seedhash) # Reproducible seed generation
# Display seed information
cat("\n=== REPRODUCIBLE SEED INFORMATION ===")##
## === REPRODUCIBLE SEED INFORMATION ===
##
## Generator Name: Ziyuan Huang
##
## MD5 Hash: 0e112cb70d324ea5e5b9f803b4488bde
##
## Available Seeds: -805293604, -909845485, -339394968, 586956699, -960637137 ...
R handles different types of variables. Let’s explore them:
# Numeric (continuous) variables
age <- 25
height <- 175.5
temperature <- 98.6
# Integer variables
num_students <- 30L
year <- 2025L
# Character (string) variables
name <- "John Doe"
city <- "Washington DC"
# Logical (boolean) variables
is_student <- TRUE
passed_exam <- FALSE
# Factor (categorical) variables
education_level <- factor(c("High School", "Bachelor", "Master", "PhD"))
grade <- factor(c("A", "B", "C", "D", "F"), ordered = TRUE)
# Check variable types
cat("Type of age:", class(age), "\n")## Type of age: numeric
## Type of name: character
## Type of is_student: logical
## Type of education_level: factor
# Binary variable example: Pass/Fail
pass_fail <- c("Pass", "Fail", "Pass", "Pass", "Fail")
pass_fail_factor <- factor(pass_fail)
cat("Binary variable levels:", levels(pass_fail_factor), "\n")## Binary variable levels: Fail Pass
## pass_fail_factor
## Fail Pass
## 2 3
# Nominal: Categories without order
departments <- factor(c("HR", "IT", "Finance", "Marketing", "IT", "HR", "Finance"))
cat("Department frequency:\n")## Department frequency:
## departments
## Finance HR IT Marketing
## 2 2 2 1
# Visualize
barplot(table(departments),
main = "Department Distribution",
col = "steelblue",
ylab = "Frequency",
las = 1)# Ordinal: Categories with logical order
satisfaction <- factor(
c("Very Unsatisfied", "Unsatisfied", "Neutral", "Satisfied", "Very Satisfied",
"Satisfied", "Neutral", "Very Satisfied", "Satisfied", "Unsatisfied"),
levels = c("Very Unsatisfied", "Unsatisfied", "Neutral", "Satisfied", "Very Satisfied"),
ordered = TRUE
)
cat("Satisfaction levels:\n")## Satisfaction levels:
## satisfaction
## Very Unsatisfied Unsatisfied Neutral Satisfied
## 1 2 2 3
## Very Satisfied
## 2
# Visualize with ordered categories
barplot(table(satisfaction),
main = "Customer Satisfaction Survey",
col = rainbow(5),
ylab = "Count",
las = 2,
cex.names = 0.8)# Interval: Equal intervals but no true zero (e.g., temperature in Celsius)
temperatures <- c(15, 18, 22, 25, 20, 16, 19, 23, 21, 17)
cat("Temperature Statistics:\n")## Temperature Statistics:
## Mean: 19.6 °C
## Median: 19.5 °C
## Range: 15 to 25 °C
# Note: 0°C doesn't mean "no temperature"
hist(temperatures,
main = "Temperature Distribution",
xlab = "Temperature (°C)",
col = "lightblue",
border = "white")# Ratio: Equal intervals AND true zero (e.g., height, weight, income)
heights <- c(160, 175, 182, 168, 155, 190, 172, 165, 178, 185)
cat("Height Statistics:\n")## Height Statistics:
## Mean: 173 cm
## Median: 173.5 cm
## Note: 0 cm would mean no height (true zero)
# Ratio calculations make sense
cat("\nRatio example: 180cm is", 180/90, "times taller than 90cm\n")##
## Ratio example: 180cm is 2 times taller than 90cm
Let’s work with built-in R datasets and real examples:
# Load built-in datasets
data(iris) # Famous flower dataset
data(mtcars) # Motor cars data
data(airquality) # Air quality measurements
# Preview datasets
cat("Iris dataset (first 6 rows):\n")## Iris dataset (first 6 rows):
| Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | Species |
|---|---|---|---|---|
| 5.1 | 3.5 | 1.4 | 0.2 | setosa |
| 4.9 | 3.0 | 1.4 | 0.2 | setosa |
| 4.7 | 3.2 | 1.3 | 0.2 | setosa |
| 4.6 | 3.1 | 1.5 | 0.2 | setosa |
| 5.0 | 3.6 | 1.4 | 0.2 | setosa |
| 5.4 | 3.9 | 1.7 | 0.4 | setosa |
# Calculate mean
sepal_lengths <- iris$Sepal.Length
mean_value <- mean(sepal_lengths)
cat("Mean Sepal Length:", round(mean_value, 2), "cm\n")## Mean Sepal Length: 5.84 cm
# Visualize with mean line
hist(sepal_lengths,
main = "Sepal Length Distribution with Mean",
xlab = "Sepal Length (cm)",
col = "lightgreen",
border = "white")
abline(v = mean_value, col = "red", lwd = 3, lty = 2)
legend("topright", legend = paste("Mean =", round(mean_value, 2)),
col = "red", lty = 2, lwd = 3)Understanding the Mean:
\[\bar{X} = \frac{\sum_{i=1}^{n} x_i}{n} = \frac{x_1 + x_2 + ... + x_n}{n}\]
# Manual calculation to understand the formula
sum_values <- sum(sepal_lengths)
n_values <- length(sepal_lengths)
manual_mean <- sum_values / n_values
cat("Sum of values:", sum_values, "\n")## Sum of values: 876.5
## Number of values: 150
## Manual mean calculation: 5.84
## R's mean() function: 5.84
# Calculate median
median_value <- median(sepal_lengths)
cat("Median Sepal Length:", round(median_value, 2), "cm\n")## Median Sepal Length: 5.8 cm
## Mean Sepal Length: 5.84 cm
## Difference: 0.043 cm
# Visualize both
hist(sepal_lengths,
main = "Sepal Length: Mean vs Median",
xlab = "Sepal Length (cm)",
col = "skyblue",
border = "white")
abline(v = mean_value, col = "red", lwd = 3, lty = 2)
abline(v = median_value, col = "blue", lwd = 3, lty = 2)
legend("topright",
legend = c(paste("Mean =", round(mean_value, 2)),
paste("Median =", round(median_value, 2))),
col = c("red", "blue"),
lty = 2,
lwd = 3)Finding the Median:
For odd number of values: Middle value For even number of values: Average of two middle values
# Example with manual calculation
facebook_friends <- c(22, 40, 53, 57, 93, 98, 103, 108, 116, 121, 252)
sorted_friends <- sort(facebook_friends)
cat("Sorted values:", sorted_friends, "\n")## Sorted values: 22 40 53 57 93 98 103 108 116 121 252
## Position of median: (n+1)/2 = (11+1)/2 = 6
## Median value: 98
# Remove extreme value
friends_no_extreme <- facebook_friends[facebook_friends != 252]
cat("\nWithout extreme value (252):\n")##
## Without extreme value (252):
## New median: 95.5
## Change: 2.5
# R doesn't have a built-in mode function, so we create one
get_mode <- function(x) {
unique_vals <- unique(x)
freq_table <- tabulate(match(x, unique_vals))
unique_vals[which.max(freq_table)]
}
# Example with discrete data
grades <- c("A", "B", "B", "C", "B", "A", "C", "B", "D", "B", "C")
mode_grade <- get_mode(grades)
cat("Mode (most frequent grade):", mode_grade, "\n")## Mode (most frequent grade): B
##
## Frequency table:
## grades
## A B C D
## 2 5 3 1
# Visualize
barplot(table(grades),
main = "Grade Distribution (Mode in Red)",
col = ifelse(names(table(grades)) == mode_grade, "red", "steelblue"),
ylab = "Frequency")# Create a comparison table
iris_summary <- iris %>%
group_by(Species) %>%
summarise(
Mean = mean(Sepal.Length),
Median = median(Sepal.Length),
Min = min(Sepal.Length),
Max = max(Sepal.Length),
N = n()
) %>%
mutate(across(where(is.numeric), ~round(., 2)))
iris_summary %>%
kable(caption = "Central Tendency by Species") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"))| Species | Mean | Median | Min | Max | N |
|---|---|---|---|---|---|
| setosa | 5.01 | 5.0 | 4.3 | 5.8 | 50 |
| versicolor | 5.94 | 5.9 | 4.9 | 7.0 | 50 |
| virginica | 6.59 | 6.5 | 4.9 | 7.9 | 50 |
# Calculate range
range_vals <- range(sepal_lengths)
range_value <- diff(range_vals)
cat("Minimum:", range_vals[1], "\n")## Minimum: 4.3
## Maximum: 7.9
## Range: 3.6
##
## Effect of extreme values:
## Original range: 230
## Without extreme (252): 99
## Reduction: 131
# Calculate quartiles
q1 <- quantile(sepal_lengths, 0.25)
q2 <- quantile(sepal_lengths, 0.50) # This is the median
q3 <- quantile(sepal_lengths, 0.75)
iqr_value <- IQR(sepal_lengths)
cat("Lower Quartile (Q1):", q1, "\n")## Lower Quartile (Q1): 5.1
## Median (Q2): 5.8
## Upper Quartile (Q3): 6.4
## Interquartile Range (IQR): 1.3
# Boxplot shows quartiles
boxplot(sepal_lengths,
main = "Boxplot Showing Quartiles",
ylab = "Sepal Length (cm)",
col = "lightblue",
horizontal = TRUE)
text(x = q1, y = 1.3, labels = "Q1", col = "red", cex = 1.2)
text(x = q2, y = 1.3, labels = "Q2 (Median)", col = "red", cex = 1.2)
text(x = q3, y = 1.3, labels = "Q3", col = "red", cex = 1.2)# Calculate variance and standard deviation
var_value <- var(sepal_lengths)
sd_value <- sd(sepal_lengths)
cat("Variance:", round(var_value, 4), "\n")## Variance: 0.6857
## Standard Deviation: 0.8281
##
## Interpretation:
## On average, sepal lengths deviate by 0.83 cm from the mean.
# Visualize with standard deviation bands
mean_val <- mean(sepal_lengths)
hist(sepal_lengths,
main = "Distribution with Standard Deviation",
xlab = "Sepal Length (cm)",
col = "lavender",
border = "white")
abline(v = mean_val, col = "red", lwd = 2)
abline(v = mean_val - sd_value, col = "blue", lwd = 2, lty = 2)
abline(v = mean_val + sd_value, col = "blue", lwd = 2, lty = 2)
legend("topright",
legend = c("Mean", "± 1 SD"),
col = c("red", "blue"),
lty = c(1, 2),
lwd = 2)Understanding Variance and Standard Deviation:
\[s^2 = \frac{\sum_{i=1}^{n}(x_i - \bar{x})^2}{n-1}\]
\[s = \sqrt{s^2}\]
# Manual calculation to understand the formula
deviations <- sepal_lengths - mean(sepal_lengths)
squared_deviations <- deviations^2
sum_squared_dev <- sum(squared_deviations)
manual_variance <- sum_squared_dev / (length(sepal_lengths) - 1)
manual_sd <- sqrt(manual_variance)
cat("Manual variance:", round(manual_variance, 4), "\n")## Manual variance: 0.6857
## R's var(): 0.6857
## Manual SD: 0.8281
## R's sd(): 0.8281
# Comprehensive dispersion measures
dispersion_summary <- data.frame(
Measure = c("Range", "IQR", "Variance", "Std Dev", "Coef of Variation"),
Value = c(
diff(range(sepal_lengths)),
IQR(sepal_lengths),
var(sepal_lengths),
sd(sepal_lengths),
sd(sepal_lengths) / mean(sepal_lengths) * 100
),
Unit = c("cm", "cm", "cm²", "cm", "%")
)
dispersion_summary %>%
mutate(Value = round(Value, 3)) %>%
kable(caption = "Measures of Dispersion for Sepal Length") %>%
kable_styling(bootstrap_options = c("striped", "hover"))| Measure | Value | Unit |
|---|---|---|
| Range | 3.600 | cm |
| IQR | 1.300 | cm |
| Variance | 0.686 | cm² |
| Std Dev | 0.828 | cm |
| Coef of Variation | 14.171 | % |
# Create frequency table
sepal_length_cut <- cut(sepal_lengths, breaks = 8)
freq_table <- table(sepal_length_cut)
cat("Frequency Distribution:\n")## Frequency Distribution:
## sepal_length_cut
## (4.3,4.75] (4.75,5.2] (5.2,5.65] (5.65,6.1] (6.1,6.55] (6.55,7] (7,7.45]
## 11 34 20 30 25 18 6
## (7.45,7.9]
## 6
# Create histogram with frequency
hist(sepal_lengths,
main = "Frequency Distribution of Sepal Length",
xlab = "Sepal Length (cm)",
ylab = "Frequency",
col = "steelblue",
border = "white",
breaks = 12)
# Add density curve
lines(density(sepal_lengths), col = "red", lwd = 3)
legend("topright", legend = "Density Curve", col = "red", lwd = 3)# Create different distribution shapes
par(mfrow = c(2, 2))
# Normal distribution
normal_data <- rnorm(1000, mean = 50, sd = 10)
hist(normal_data, main = "Normal Distribution",
col = "lightgreen", border = "white", xlab = "Values")
# Right-skewed (positive skew)
skewed_right <- rexp(1000, rate = 0.5)
hist(skewed_right, main = "Right Skewed (Positive)",
col = "lightcoral", border = "white", xlab = "Values")
# Left-skewed (negative skew)
skewed_left <- -rexp(1000, rate = 0.5)
hist(skewed_left, main = "Left Skewed (Negative)",
col = "lightblue", border = "white", xlab = "Values")
# Uniform distribution
uniform_data <- runif(1000, min = 0, max = 100)
hist(uniform_data, main = "Uniform Distribution",
col = "lavender", border = "white", xlab = "Values")library(moments) # For skewness and kurtosis
# Calculate shape statistics
skewness_val <- skewness(sepal_lengths)
kurtosis_val <- kurtosis(sepal_lengths)
cat("Skewness:", round(skewness_val, 3), "\n")## Skewness: 0.312
cat("Interpretation:",
ifelse(abs(skewness_val) < 0.5, "Approximately Symmetric",
ifelse(skewness_val > 0, "Right-skewed (Positive)", "Left-skewed (Negative)")), "\n\n")## Interpretation: Approximately Symmetric
## Kurtosis: 2.426
cat("Interpretation:",
ifelse(kurtosis_val > 3, "Leptokurtic (heavy tails)",
ifelse(kurtosis_val < 3, "Platykurtic (light tails)", "Mesokurtic (normal)")), "\n")## Interpretation: Platykurtic (light tails)
## === IRIS DATASET SUMMARY ===
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
## 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
## Median :5.800 Median :3.000 Median :4.350 Median :1.300
## Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
## 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
## Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
## Species
## setosa :50
## versicolor:50
## virginica :50
##
##
##
# More detailed summary by species
iris_detailed <- iris %>%
group_by(Species) %>%
summarise(
N = n(),
Mean_Sepal_Length = mean(Sepal.Length),
SD_Sepal_Length = sd(Sepal.Length),
Mean_Sepal_Width = mean(Sepal.Width),
SD_Sepal_Width = sd(Sepal.Width),
Mean_Petal_Length = mean(Petal.Length),
SD_Petal_Length = sd(Petal.Length)
) %>%
mutate(across(where(is.numeric), ~round(., 2)))
iris_detailed %>%
kable(caption = "Detailed Summary by Species") %>%
kable_styling(bootstrap_options = c("striped", "hover")) %>%
scroll_box(width = "100%")| Species | N | Mean_Sepal_Length | SD_Sepal_Length | Mean_Sepal_Width | SD_Sepal_Width | Mean_Petal_Length | SD_Petal_Length |
|---|---|---|---|---|---|---|---|
| setosa | 50 | 5.01 | 0.35 | 3.43 | 0.38 | 1.46 | 0.17 |
| versicolor | 50 | 5.94 | 0.52 | 2.77 | 0.31 | 4.26 | 0.47 |
| virginica | 50 | 6.59 | 0.64 | 2.97 | 0.32 | 5.55 | 0.55 |
# Multiple ways to visualize single variable
par(mfrow = c(2, 2))
# Histogram
hist(iris$Sepal.Length,
main = "Histogram",
xlab = "Sepal Length (cm)",
col = "steelblue",
border = "white")
# Density plot
plot(density(iris$Sepal.Length),
main = "Density Plot",
xlab = "Sepal Length (cm)",
lwd = 2,
col = "darkblue")
polygon(density(iris$Sepal.Length), col = rgb(0, 0, 1, 0.3))
# Boxplot
boxplot(iris$Sepal.Length,
main = "Boxplot",
ylab = "Sepal Length (cm)",
col = "lightgreen")
# Q-Q plot (check normality)
qqnorm(iris$Sepal.Length, main = "Q-Q Plot")
qqline(iris$Sepal.Length, col = "red", lwd = 2)# Relationship between two variables
ggplot(iris, aes(x = Sepal.Length, y = Sepal.Width, color = Species)) +
geom_point(size = 3, alpha = 0.7) +
geom_smooth(method = "lm", se = FALSE) +
labs(
title = "Sepal Length vs Sepal Width by Species",
x = "Sepal Length (cm)",
y = "Sepal Width (cm)"
) +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5, face = "bold"))# Pairs plot for multiple variables
pairs(iris[, 1:4],
main = "Scatterplot Matrix of Iris Dataset",
pch = 21,
bg = c("red", "green", "blue")[unclass(iris$Species)])# Faceted visualization
ggplot(iris, aes(x = Sepal.Length, fill = Species)) +
geom_histogram(bins = 20, alpha = 0.7, color = "white") +
facet_wrap(~Species, ncol = 1) +
labs(
title = "Sepal Length Distribution by Species",
x = "Sepal Length (cm)",
y = "Frequency"
) +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.5, face = "bold"),
legend.position = "none"
)## === MISSING VALUES CHECK ===
missing_summary <- data.frame(
Variable = names(iris),
Missing_Count = colSums(is.na(iris)),
Missing_Percent = round(colSums(is.na(iris)) / nrow(iris) * 100, 2)
)
missing_summary %>%
kable(caption = "Missing Values Summary") %>%
kable_styling(bootstrap_options = c("striped", "hover"))| Variable | Missing_Count | Missing_Percent | |
|---|---|---|---|
| Sepal.Length | Sepal.Length | 0 | 0 |
| Sepal.Width | Sepal.Width | 0 | 0 |
| Petal.Length | Petal.Length | 0 | 0 |
| Petal.Width | Petal.Width | 0 | 0 |
| Species | Species | 0 | 0 |
# Check for outliers using IQR method
detect_outliers <- function(x) {
q1 <- quantile(x, 0.25)
q3 <- quantile(x, 0.75)
iqr <- q3 - q1
lower_bound <- q1 - 1.5 * iqr
upper_bound <- q3 + 1.5 * iqr
sum(x < lower_bound | x > upper_bound)
}
cat("\n=== OUTLIER DETECTION ===\n")##
## === OUTLIER DETECTION ===
outlier_summary <- iris %>%
select(where(is.numeric)) %>%
summarise(across(everything(), detect_outliers)) %>%
pivot_longer(everything(), names_to = "Variable", values_to = "Outlier_Count")
outlier_summary %>%
kable(caption = "Outlier Count by Variable (IQR Method)") %>%
kable_styling(bootstrap_options = c("striped", "hover"))| Variable | Outlier_Count |
|---|---|
| Sepal.Length | 0 |
| Sepal.Width | 4 |
| Petal.Length | 0 |
| Petal.Width | 0 |
# Build a simple linear regression model
# Predict Sepal Width from Sepal Length
model_simple <- lm(Sepal.Width ~ Sepal.Length, data = iris)
# View model summary
summary(model_simple)##
## Call:
## lm(formula = Sepal.Width ~ Sepal.Length, data = iris)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.1095 -0.2454 -0.0167 0.2763 1.3338
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.41895 0.25356 13.48 <2e-16 ***
## Sepal.Length -0.06188 0.04297 -1.44 0.152
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4343 on 148 degrees of freedom
## Multiple R-squared: 0.01382, Adjusted R-squared: 0.007159
## F-statistic: 2.074 on 1 and 148 DF, p-value: 0.1519
# Extract key statistics
r_squared <- summary(model_simple)$r.squared
adj_r_squared <- summary(model_simple)$adj.r.squared
coef_intercept <- coef(model_simple)[1]
coef_slope <- coef(model_simple)[2]
cat("\n=== MODEL INTERPRETATION ===\n")##
## === MODEL INTERPRETATION ===
cat("Equation: Sepal.Width =", round(coef_intercept, 3), "+",
round(coef_slope, 3), "× Sepal.Length\n")## Equation: Sepal.Width = 3.419 + -0.062 × Sepal.Length
## R-squared: 0.0138
## Adjusted R-squared: 0.0072
# Plot with regression line
ggplot(iris, aes(x = Sepal.Length, y = Sepal.Width)) +
geom_point(aes(color = Species), size = 3, alpha = 0.6) +
geom_smooth(method = "lm", se = TRUE, color = "blue", fill = "lightblue") +
labs(
title = "Simple Linear Regression: Sepal Width vs Sepal Length",
subtitle = paste("R² =", round(r_squared, 3)),
x = "Sepal Length (cm)",
y = "Sepal Width (cm)"
) +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5, face = "bold"))##
## === DIAGNOSTIC INTERPRETATION ===
## 1. Residuals vs Fitted: Check for linearity and homoscedasticity
## 2. Q-Q Plot: Check if residuals are normally distributed
## 3. Scale-Location: Check homoscedasticity
## 4. Residuals vs Leverage: Identify influential points
# Build multiple regression model
# Predict Petal Length from multiple predictors
model_multiple <- lm(Petal.Length ~ Sepal.Length + Sepal.Width + Petal.Width,
data = iris)
# View summary
summary(model_multiple)##
## Call:
## lm(formula = Petal.Length ~ Sepal.Length + Sepal.Width + Petal.Width,
## data = iris)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.99333 -0.17656 -0.01004 0.18558 1.06909
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.26271 0.29741 -0.883 0.379
## Sepal.Length 0.72914 0.05832 12.502 <2e-16 ***
## Sepal.Width -0.64601 0.06850 -9.431 <2e-16 ***
## Petal.Width 1.44679 0.06761 21.399 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.319 on 146 degrees of freedom
## Multiple R-squared: 0.968, Adjusted R-squared: 0.9674
## F-statistic: 1473 on 3 and 146 DF, p-value: < 2.2e-16
##
## === MODEL COMPARISON ===
comparison <- data.frame(
Model = c("Simple Regression", "Multiple Regression"),
Predictors = c(1, 3),
R_squared = c(
summary(model_simple)$r.squared,
summary(model_multiple)$r.squared
),
Adj_R_squared = c(
summary(model_simple)$adj.r.squared,
summary(model_multiple)$adj.r.squared
),
RMSE = c(
sqrt(mean(model_simple$residuals^2)),
sqrt(mean(model_multiple$residuals^2))
)
)
comparison %>%
mutate(across(where(is.numeric), ~round(., 4))) %>%
kable(caption = "Model Performance Comparison") %>%
kable_styling(bootstrap_options = c("striped", "hover"))| Model | Predictors | R_squared | Adj_R_squared | RMSE |
|---|---|---|---|---|
| Simple Regression | 1 | 0.0138 | 0.0072 | 0.4314 |
| Multiple Regression | 3 | 0.9680 | 0.9674 | 0.3147 |
# Create new data for predictions
new_data <- data.frame(
Sepal.Length = c(5.0, 6.0, 7.0),
Sepal.Width = c(3.0, 3.5, 3.2),
Petal.Width = c(1.5, 2.0, 2.5)
)
# Make predictions
predictions <- predict(model_multiple, newdata = new_data, interval = "prediction")
# Combine with input data
prediction_results <- cbind(new_data, predictions) %>%
mutate(across(where(is.numeric), ~round(., 2)))
prediction_results %>%
kable(caption = "Predictions with 95% Confidence Intervals") %>%
kable_styling(bootstrap_options = c("striped", "hover"))| Sepal.Length | Sepal.Width | Petal.Width | fit | lwr | upr |
|---|---|---|---|---|---|
| 5 | 3.0 | 1.5 | 3.62 | 2.97 | 4.26 |
| 6 | 3.5 | 2.0 | 4.74 | 4.10 | 5.39 |
| 7 | 3.2 | 2.5 | 6.39 | 5.75 | 7.03 |
A hypothesis is a prediction from a theory that can be tested through data collection and analysis. It’s a specific, testable statement about the relationship between variables.
Key Characteristics of a Good Hypothesis:
| Type | Symbol | Description | Example |
|---|---|---|---|
| Alternative Hypothesis (H₁) | H₁ or Hₐ | States that an effect WILL occur; the research prediction | Manual transmission cars have higher MPG than automatic cars |
| Null Hypothesis (H₀) | H₀ | States that NO effect will occur; opposite of alternative hypothesis | Manual and automatic cars have the same MPG |
| Directional Hypothesis | H₁ (one-tailed) | Predicts the DIRECTION of the effect (e.g., ‘greater than’, ‘less than’) | Manual cars have HIGHER MPG than automatic cars |
| Non-directional Hypothesis | H₁ (two-tailed) | Predicts an effect exists but NOT its direction (e.g., ‘different from’) | Manual and automatic cars have DIFFERENT MPG |
Important Concept: We can never prove the alternative hypothesis using statistics. We can only:
“We cannot talk about the null hypothesis being true or the experimental hypothesis being true, we can only talk in terms of the probability of obtaining a particular set of data if, hypothetically speaking, the null hypothesis was true.”
— Field, Miles, & Field (2012)
## package 'plotly' successfully unpacked and MD5 sums checked
##
## The downloaded binary packages are in
## C:\Users\huang\AppData\Local\Temp\Rtmp2JtB7n\downloaded_packages
Figure: Hypothesis Testing Logic Flow
The p-value is the probability of obtaining your observed data (or more extreme) assuming the null hypothesis is true.
Interpretation: - Small p-value (typically < 0.05): Data is unlikely under H₀ → Reject H₀ - Large p-value (≥ 0.05): Data is plausible under H₀ → Fail to reject H₀
Common Significance Levels: - α = 0.05 (5%) - Most common in social sciences - α = 0.01 (1%) - More stringent - α = 0.10 (10%) - More lenient (exploratory research)
Research Question: Is the average sepal length of iris flowers different from 6 cm?
Hypotheses: - H₀: μ = 6 (population mean equals 6) - H₁: μ ≠ 6 (population mean is different from 6) [two-tailed]
# One-sample t-test
# Test if mean sepal length differs from 6 cm
test_value <- 6
# Perform the test
result_onesample <- t.test(iris$Sepal.Length, mu = test_value)
cat("=== ONE-SAMPLE T-TEST ===\n\n")## === ONE-SAMPLE T-TEST ===
## Research Question: Is mean sepal length different from 6 cm?
## H₀: μ = 6
## H₁: μ ≠ 6
##
## One Sample t-test
##
## data: iris$Sepal.Length
## t = -2.3172, df = 149, p-value = 0.02186
## alternative hypothesis: true mean is not equal to 6
## 95 percent confidence interval:
## 5.709732 5.976934
## sample estimates:
## mean of x
## 5.843333
##
## === INTERPRETATION ===
## Sample mean: 5.843 cm
## t-statistic: -2.317
## p-value: 0.02186
cat("95% CI: [", round(result_onesample$conf.int[1], 3), ",",
round(result_onesample$conf.int[2], 3), "]\n\n")## 95% CI: [ 5.71 , 5.977 ]
if(result_onesample$p.value < 0.05) {
cat("Decision: REJECT H₀ (p < 0.05)\n")
cat("Conclusion: The mean sepal length is significantly different from", test_value, "cm.\n")
} else {
cat("Decision: FAIL TO REJECT H₀ (p ≥ 0.05)\n")
cat("Conclusion: Insufficient evidence that mean differs from", test_value, "cm.\n")
}## Decision: REJECT H₀ (p < 0.05)
## Conclusion: The mean sepal length is significantly different from 6 cm.
# Visualize
hist(iris$Sepal.Length,
main = "Sepal Length Distribution with Test Value",
xlab = "Sepal Length (cm)",
col = "lightblue",
border = "white",
breaks = 15)
abline(v = mean(iris$Sepal.Length), col = "red", lwd = 3, lty = 1)
abline(v = test_value, col = "blue", lwd = 3, lty = 2)
legend("topright",
legend = c(paste("Sample Mean =", round(mean(iris$Sepal.Length), 2)),
paste("Test Value =", test_value)),
col = c("red", "blue"),
lty = c(1, 2),
lwd = 3)Research Question: Do manual transmission cars have different fuel efficiency than automatic transmission cars?
Hypotheses: - H₀: μ_manual = μ_automatic (no difference in mean MPG) - H₁: μ_manual ≠ μ_automatic (means are different) [two-tailed]
# Independent samples t-test
# Compare MPG between automatic and manual transmission
# Prepare data
automatic <- mtcars$mpg[mtcars$am == 0]
manual <- mtcars$mpg[mtcars$am == 1]
cat("=== INDEPENDENT TWO-SAMPLE T-TEST ===\n\n")## === INDEPENDENT TWO-SAMPLE T-TEST ===
## Research Question: Do manual and automatic cars differ in fuel efficiency?
## H₀: μ_manual = μ_automatic
## H₁: μ_manual ≠ μ_automatic
# Perform t-test
result_independent <- t.test(manual, automatic, var.equal = FALSE)
print(result_independent)##
## Welch Two Sample t-test
##
## data: manual and automatic
## t = 3.7671, df = 18.332, p-value = 0.001374
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 3.209684 11.280194
## sample estimates:
## mean of x mean of y
## 24.39231 17.14737
##
## === DESCRIPTIVE STATISTICS ===
## Automatic transmission:
## n = 19
## Mean = 17.15 mpg
## SD = 3.83
## Manual transmission:
## n = 13
## Mean = 24.39 mpg
## SD = 6.17
## === INTERPRETATION ===
## Difference in means: 7.24 mpg
## t-statistic: 3.767
## p-value: 0.001374
if(result_independent$p.value < 0.05) {
cat("Decision: REJECT H₀ (p < 0.05)\n")
cat("Conclusion: Manual and automatic transmissions have significantly different MPG.\n")
cat("Manual transmission cars have",
ifelse(mean(manual) > mean(automatic), "HIGHER", "LOWER"),
"fuel efficiency.\n")
} else {
cat("Decision: FAIL TO REJECT H₀ (p ≥ 0.05)\n")
cat("Conclusion: No significant difference in MPG between transmission types.\n")
}## Decision: REJECT H₀ (p < 0.05)
## Conclusion: Manual and automatic transmissions have significantly different MPG.
## Manual transmission cars have HIGHER fuel efficiency.
# Visualize
boxplot(mpg ~ am, data = mtcars,
names = c("Automatic", "Manual"),
main = "Fuel Efficiency by Transmission Type",
ylab = "Miles per Gallon (MPG)",
col = c("lightcoral", "lightgreen"),
border = c("darkred", "darkgreen"))
points(c(1, 2), c(mean(automatic), mean(manual)),
pch = 19, col = "blue", cex = 2)
legend("bottomright", legend = "Mean", pch = 19, col = "blue", cex = 1.2)Research Question: Does a new teaching method improve test scores?
Hypotheses: - H₀: μ_difference = 0 (no change in scores) - H₁: μ_difference > 0 (scores increased) [one-tailed]
# Paired t-test
# Create example data: before and after scores
set.seed(seeds[3])
n_students <- 25
before_scores <- rnorm(n_students, mean = 70, sd = 10)
# Add improvement effect
improvement <- rnorm(n_students, mean = 5, sd = 8)
after_scores <- before_scores + improvement
student_data <- data.frame(
Student = 1:n_students,
Before = round(before_scores, 1),
After = round(after_scores, 1),
Difference = round(after_scores - before_scores, 1)
)
cat("=== PAIRED T-TEST (BEFORE-AFTER DESIGN) ===\n\n")## === PAIRED T-TEST (BEFORE-AFTER DESIGN) ===
## Research Question: Did the new teaching method improve test scores?
## H₀: μ_difference = 0 (no improvement)
## H₁: μ_difference > 0 (scores improved) [one-tailed]
## Sample Data (first 6 students):
| Student | Before | After | Difference |
|---|---|---|---|
| 1 | 85.0 | 84.3 | -0.7 |
| 2 | 54.8 | 55.0 | 0.2 |
| 3 | 67.6 | 86.9 | 19.3 |
| 4 | 59.0 | 64.8 | 5.8 |
| 5 | 77.2 | 90.9 | 13.7 |
| 6 | 72.7 | 81.5 | 8.9 |
# Perform paired t-test (one-tailed)
result_paired <- t.test(student_data$After, student_data$Before,
paired = TRUE,
alternative = "greater")
cat("\n")##
## Paired t-test
##
## data: student_data$After and student_data$Before
## t = 3.1343, df = 24, p-value = 0.002251
## alternative hypothesis: true mean difference is greater than 0
## 95 percent confidence interval:
## 2.44324 Inf
## sample estimates:
## mean difference
## 5.38
##
## === DESCRIPTIVE STATISTICS ===
cat("Before: Mean =", round(mean(student_data$Before), 2),
", SD =", round(sd(student_data$Before), 2), "\n")## Before: Mean = 70.15 , SD = 11.15
cat("After: Mean =", round(mean(student_data$After), 2),
", SD =", round(sd(student_data$After), 2), "\n")## After: Mean = 75.53 , SD = 14.87
## Mean Difference: 5.38 points
## === INTERPRETATION ===
## t-statistic: 3.134
## p-value (one-tailed): 0.002251
if(result_paired$p.value < 0.05) {
cat("Decision: REJECT H₀ (p < 0.05)\n")
cat("Conclusion: The teaching method significantly improved test scores.\n")
} else {
cat("Decision: FAIL TO REJECT H₀ (p ≥ 0.05)\n")
cat("Conclusion: No significant evidence of improvement.\n")
}## Decision: REJECT H₀ (p < 0.05)
## Conclusion: The teaching method significantly improved test scores.
# Visualize
par(mfrow = c(1, 2))
# Box plot comparison
boxplot(student_data$Before, student_data$After,
names = c("Before", "After"),
main = "Score Comparison",
ylab = "Test Score",
col = c("lightcoral", "lightgreen"))
# Difference histogram
hist(student_data$Difference,
main = "Score Changes",
xlab = "Difference (After - Before)",
col = "lightblue",
border = "white")
abline(v = 0, col = "red", lwd = 3, lty = 2)
abline(v = mean(student_data$Difference), col = "blue", lwd = 3)
legend("topright",
legend = c("No Change", "Mean Change"),
col = c("red", "blue"),
lty = c(2, 1),
lwd = 3)Research Question: Is there a relationship between transmission type and number of cylinders?
## === CHI-SQUARE TEST OF INDEPENDENCE ===
## Research Question: Are transmission type and cylinder count independent?
## H₀: Transmission and cylinders are independent (no relationship)
## H₁: Transmission and cylinders are related
# Create contingency table
contingency_table <- table(mtcars$am, mtcars$cyl)
rownames(contingency_table) <- c("Automatic", "Manual")
colnames(contingency_table) <- paste(c(4, 6, 8), "cyl")
cat("Contingency Table:\n")## Contingency Table:
##
## 4 cyl 6 cyl 8 cyl
## Automatic 3 4 12
## Manual 8 3 2
##
## Pearson's Chi-squared test
##
## data: contingency_table
## X-squared = 8.7407, df = 2, p-value = 0.01265
##
## === INTERPRETATION ===
## Chi-square statistic: 8.741
## p-value: 0.01265
if(result_chisq$p.value < 0.05) {
cat("Decision: REJECT H₀ (p < 0.05)\n")
cat("Conclusion: There is a significant relationship between transmission and cylinders.\n")
} else {
cat("Decision: FAIL TO REJECT H₀ (p ≥ 0.05)\n")
cat("Conclusion: No significant relationship detected.\n")
}## Decision: REJECT H₀ (p < 0.05)
## Conclusion: There is a significant relationship between transmission and cylinders.
# Visualize
barplot(contingency_table,
beside = TRUE,
legend = TRUE,
main = "Transmission Type by Cylinder Count",
xlab = "Number of Cylinders",
ylab = "Frequency",
col = c("lightcoral", "lightgreen"))Research Question: Is there a linear relationship between car weight and fuel efficiency?
## === CORRELATION TEST ===
## Research Question: Is there a correlation between weight and MPG?
## H₀: ρ = 0 (no correlation)
## H₁: ρ ≠ 0 (correlation exists)
##
## Pearson's product-moment correlation
##
## data: mtcars$wt and mtcars$mpg
## t = -9.559, df = 30, p-value = 1.294e-10
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.9338264 -0.7440872
## sample estimates:
## cor
## -0.8676594
##
## === INTERPRETATION ===
## Correlation coefficient (r): -0.868
## t-statistic: -9.559
## p-value: 1.294e-10
if(result_cor$p.value < 0.05) {
cat("Decision: REJECT H₀ (p < 0.05)\n")
cat("Conclusion: There is a significant",
ifelse(result_cor$estimate < 0, "NEGATIVE", "POSITIVE"),
"correlation.\n")
cat("Interpretation:", abs(round(result_cor$estimate, 3)),
"indicates a",
ifelse(abs(result_cor$estimate) > 0.7, "strong",
ifelse(abs(result_cor$estimate) > 0.4, "moderate", "weak")),
"relationship.\n")
} else {
cat("Decision: FAIL TO REJECT H₀ (p ≥ 0.05)\n")
cat("Conclusion: No significant correlation detected.\n")
}## Decision: REJECT H₀ (p < 0.05)
## Conclusion: There is a significant NEGATIVE correlation.
## Interpretation: 0.868 indicates a strong relationship.
# Visualize
plot(mtcars$wt, mtcars$mpg,
main = paste("Weight vs MPG (r =", round(result_cor$estimate, 3), ")"),
xlab = "Weight (1000 lbs)",
ylab = "Miles per Gallon",
pch = 19,
col = "steelblue")
abline(lm(mpg ~ wt, data = mtcars), col = "red", lwd = 2)Effect size measures the magnitude of a difference or relationship, independent of sample size.
Common Effect Size Measures: - Cohen’s d: For t-tests (small: 0.2, medium: 0.5, large: 0.8) - r: For correlations (small: 0.1, medium: 0.3, large: 0.5) - η² (eta-squared): For ANOVA
# Calculate Cohen's d for transmission comparison
# Install effsize if needed
if (!require("effsize", quietly = TRUE)) {
install.packages("effsize", repos = "https://cran.r-project.org")
library(effsize)
}
cat("=== EFFECT SIZE ANALYSIS ===\n\n")## === EFFECT SIZE ANALYSIS ===
# Cohen's d for transmission effect on MPG
cohens_d <- cohen.d(mtcars$mpg ~ factor(mtcars$am))
print(cohens_d)##
## Cohen's d
##
## d estimate: -1.477947 (large)
## 95 percent confidence interval:
## lower upper
## -2.304209 -0.651685
##
## === INTERPRETATION ===
## Cohen's d: -1.478
## Magnitude: 4
## Effect Size Guidelines (Cohen, 1988):
## Small: d = 0.2
## Medium: d = 0.5
## Large: d = 0.8
## Practical Significance:
if(abs(cohens_d$estimate) < 0.2) {
cat("The effect is negligible - may not be practically important.\n")
} else if(abs(cohens_d$estimate) < 0.5) {
cat("The effect is small - some practical importance.\n")
} else if(abs(cohens_d$estimate) < 0.8) {
cat("The effect is medium - likely practically important.\n")
} else {
cat("The effect is large - very practically important.\n")
}## The effect is large - very practically important.
| Test | Purpose | Data_Type | R_Function |
|---|---|---|---|
| One-Sample t-test | Compare sample mean to known value | Continuous | t.test(x, mu=value) |
| Independent t-test | Compare means of two independent groups | Continuous | t.test(x, y) |
| Paired t-test | Compare means of paired/related observations | Continuous | t.test(x, y, paired=TRUE) |
| Chi-Square Test | Test relationship between categorical variables | Categorical | chisq.test(table) |
| Correlation Test | Test linear relationship between continuous variables | Continuous | cor.test(x, y) |
| ANOVA | Compare means of 3+ groups | Continuous | aov() or anova() |
## === AIR QUALITY DATASET ===
## Dimensions: 153 6
## Variables: Ozone Solar.R Wind Temp Month Day
## Ozone Solar.R Wind Temp
## Min. : 1.00 Min. : 7.0 Min. : 1.700 Min. :56.00
## 1st Qu.: 18.00 1st Qu.:115.8 1st Qu.: 7.400 1st Qu.:72.00
## Median : 31.50 Median :205.0 Median : 9.700 Median :79.00
## Mean : 42.13 Mean :185.9 Mean : 9.958 Mean :77.88
## 3rd Qu.: 63.25 3rd Qu.:258.8 3rd Qu.:11.500 3rd Qu.:85.00
## Max. :168.00 Max. :334.0 Max. :20.700 Max. :97.00
## NA's :37 NA's :7
## Month Day
## Min. :5.000 Min. : 1.0
## 1st Qu.:6.000 1st Qu.: 8.0
## Median :7.000 Median :16.0
## Mean :6.993 Mean :15.8
## 3rd Qu.:8.000 3rd Qu.:23.0
## Max. :9.000 Max. :31.0
##
##
## Missing values by variable:
## Ozone Solar.R Wind Temp Month Day
## 37 7 0 0 0 0
# Visualize relationships
ggplot(airquality, aes(x = Temp, y = Ozone)) +
geom_point(aes(color = Month), alpha = 0.6, size = 3) +
geom_smooth(method = "lm", se = TRUE, color = "red") +
labs(
title = "Ozone Levels vs Temperature",
subtitle = "New York Air Quality Measurements (1973)",
x = "Temperature (°F)",
y = "Ozone (ppb)"
) +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5, face = "bold"))## === MOTOR TRENDS ANALYSIS ===
# Descriptive statistics by transmission type
mtcars_summary <- mtcars %>%
mutate(Transmission = factor(am, labels = c("Automatic", "Manual"))) %>%
group_by(Transmission) %>%
summarise(
N = n(),
Mean_MPG = mean(mpg),
SD_MPG = sd(mpg),
Mean_HP = mean(hp),
SD_HP = sd(hp),
Mean_Weight = mean(wt)
) %>%
mutate(across(where(is.numeric), ~round(., 2)))
mtcars_summary %>%
kable(caption = "Car Performance by Transmission Type") %>%
kable_styling(bootstrap_options = c("striped", "hover"))| Transmission | N | Mean_MPG | SD_MPG | Mean_HP | SD_HP | Mean_Weight |
|---|---|---|---|---|---|---|
| Automatic | 19 | 17.15 | 3.83 | 160.26 | 53.91 | 3.77 |
| Manual | 13 | 24.39 | 6.17 | 126.85 | 84.06 | 2.41 |
# Visualize MPG distribution
ggplot(mtcars, aes(x = factor(am, labels = c("Automatic", "Manual")), y = mpg)) +
geom_boxplot(aes(fill = factor(am)), alpha = 0.7) +
geom_jitter(width = 0.2, alpha = 0.5) +
labs(
title = "Fuel Efficiency by Transmission Type",
x = "Transmission",
y = "Miles per Gallon (MPG)"
) +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.5, face = "bold"),
legend.position = "none"
) +
scale_fill_brewer(palette = "Set2")# Demonstrate population vs sample concepts
# Use seedhash for reproducible sampling
set.seed(seeds[2]) # Use second seed from our generator
# Population: All iris flowers
population_mean <- mean(iris$Sepal.Length)
population_sd <- sd(iris$Sepal.Length)
cat("=== POPULATION STATISTICS ===\n")## === POPULATION STATISTICS ===
## Population Size: 150
## Population Mean: 5.843
## Population SD: 0.828
# Take multiple samples
sample_sizes <- c(10, 30, 50)
sample_results <- list()
for(n in sample_sizes) {
sample_means <- replicate(1000, mean(sample(iris$Sepal.Length, n)))
sample_results[[as.character(n)]] <- data.frame(
Sample_Size = n,
Mean_of_Means = mean(sample_means),
SD_of_Means = sd(sample_means),
Min = min(sample_means),
Max = max(sample_means)
)
}
sample_comparison <- bind_rows(sample_results) %>%
mutate(across(where(is.numeric) & !Sample_Size, ~round(., 4)))
cat("=== SAMPLING DISTRIBUTION ===\n")## === SAMPLING DISTRIBUTION ===
## Based on 1000 samples for each size
sample_comparison %>%
kable(caption = "Effect of Sample Size on Sampling Distribution") %>%
kable_styling(bootstrap_options = c("striped", "hover"))| Sample_Size | Mean_of_Means | SD_of_Means | Min | Max |
|---|---|---|---|---|
| 10 | 5.8449 | 0.2560 | 4.9900 | 6.710 |
| 30 | 5.8446 | 0.1337 | 5.4267 | 6.320 |
| 50 | 5.8371 | 0.0930 | 5.5200 | 6.112 |
# Visualize sampling distributions
par(mfrow = c(1, 3))
for(n in sample_sizes) {
sample_means <- replicate(1000, mean(sample(iris$Sepal.Length, n)))
hist(sample_means,
main = paste("Sample Size =", n),
xlab = "Sample Mean",
col = "lightblue",
border = "white",
xlim = c(5.5, 6.5))
abline(v = population_mean, col = "red", lwd = 3, lty = 2)
}Central Tendency: - Mean: Average (sensitive to outliers) - Median: Middle value (robust to outliers) - Mode: Most frequent value
Dispersion: - Range: Max - Min (sensitive to extremes) - IQR: Q3 - Q1 (robust measure) - Standard Deviation: Average deviation from mean
# ===== DATA EXPLORATION =====
# Load data
data(dataset_name)
head(df) # First 6 rows
tail(df) # Last 6 rows
str(df) # Structure
summary(df) # Summary statistics
dim(df) # Dimensions
names(df) # Variable names
# ===== DESCRIPTIVE STATISTICS =====
mean(x) # Mean
median(x) # Median
sd(x) # Standard deviation
var(x) # Variance
range(x) # Range
IQR(x) # Interquartile range
quantile(x) # Quartiles
table(x) # Frequency table
# ===== VISUALIZATION =====
hist(x) # Histogram
boxplot(x) # Boxplot
plot(x, y) # Scatterplot
barplot(table(x)) # Bar plot
pairs(df) # Scatterplot matrix
# ===== TIDYVERSE =====
df %>%
filter(condition) %>% # Filter rows
select(var1, var2) %>% # Select columns
mutate(new_var = ...) %>% # Create new variables
group_by(category) %>% # Group data
summarise(mean = mean(x)) # Summarize
# ===== MODELING =====
model <- lm(y ~ x, data = df) # Linear regression
summary(model) # Model summary
predict(model, newdata = new_df) # Predictions
plot(model) # Diagnostic plots
# ===== TESTING =====
t.test(x, y) # T-test
cor.test(x, y) # Correlation test
chisq.test(x, y) # Chi-square test
shapiro.test(x) # Normality test# ❌ BAD: Not checking for missing values
mean(data$variable)
# ✅ GOOD: Handle missing values
mean(data$variable, na.rm = TRUE)
# ❌ BAD: Assuming normality without checking
t.test(group1, group2)
# ✅ GOOD: Check normality first
shapiro.test(group1)
hist(group1)
# Then decide on appropriate test
# ❌ BAD: Not setting seed for reproducibility
sample(data, 10)
# ✅ GOOD: Set seed for reproducible results
set.seed(123)
sample(data, 10)
# ✅ BEST: Use seedhash for reproducible, named seeds
library(seedhash)
gen <- SeedHashGenerator$new("YourName")
seeds <- gen$generate_seeds(5)
set.seed(seeds[1])
sample(data, 10)
# ❌ BAD: Ignoring assumptions
model <- lm(y ~ x)
# Use model without checking
# ✅ GOOD: Check assumptions
model <- lm(y ~ x)
plot(model) # Check diagnostic plots# Data Manipulation
install.packages("dplyr") # Data wrangling
install.packages("tidyr") # Data tidying
install.packages("data.table") # Fast data manipulation
# Visualization
install.packages("ggplot2") # Advanced plotting
install.packages("plotly") # Interactive plots
install.packages("ggpubr") # Publication-ready plots
# Statistics
install.packages("moments") # Skewness & kurtosis
install.packages("car") # Regression diagnostics
install.packages("psych") # Psychological statistics
# Reproducibility
install.packages("seedhash") # Reproducible seed generation
# Tables
install.packages("knitr") # Basic tables
install.packages("kableExtra") # Enhanced tables
install.packages("gt") # Grammar of tables?function_name or
help(function_name)Using the mtcars dataset:
mpgcyl (number of
cylinders)hp (horsepower)wt (weight)mpg with appropriate labelsmpg across different
cyl valueswt vs mpg with a
regression line## R version 4.5.2 (2025-10-31 ucrt)
## Platform: x86_64-w64-mingw32/x64
## Running under: Windows 11 x64 (build 26200)
##
## Matrix products: default
## LAPACK version 3.12.1
##
## locale:
## [1] LC_COLLATE=English_United States.utf8
## [2] LC_CTYPE=English_United States.utf8
## [3] LC_MONETARY=English_United States.utf8
## [4] LC_NUMERIC=C
## [5] LC_TIME=English_United States.utf8
##
## time zone: America/New_York
## tzcode source: internal
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] effsize_0.8.1 plotly_4.11.0 moments_0.14.1 kableExtra_1.4.0
## [5] knitr_1.50 lubridate_1.9.4 forcats_1.0.1 stringr_1.5.2
## [9] dplyr_1.1.4 purrr_1.1.0 readr_2.1.5 tidyr_1.3.1
## [13] tibble_3.3.0 ggplot2_4.0.0 tidyverse_2.0.0 seedhash_0.1.0
##
## loaded via a namespace (and not attached):
## [1] sass_0.4.10 generics_0.1.4 xml2_1.4.1 lattice_0.22-7
## [5] stringi_1.8.7 hms_1.1.4 digest_0.6.37 magrittr_2.0.4
## [9] evaluate_1.0.5 grid_4.5.2 timechange_0.3.0 RColorBrewer_1.1-3
## [13] fastmap_1.2.0 Matrix_1.7-4 jsonlite_2.0.0 httr_1.4.7
## [17] mgcv_1.9-3 crosstalk_1.2.2 viridisLite_0.4.2 scales_1.4.0
## [21] lazyeval_0.2.2 textshaping_1.0.4 jquerylib_0.1.4 cli_3.6.5
## [25] rlang_1.1.6 splines_4.5.2 withr_3.0.2 cachem_1.1.0
## [29] yaml_2.3.10 tools_4.5.2 tzdb_0.5.0 vctrs_0.6.5
## [33] R6_2.6.1 lifecycle_1.0.4 htmlwidgets_1.6.4 pkgconfig_2.0.3
## [37] pillar_1.11.1 bslib_0.9.0 gtable_0.3.6 data.table_1.17.8
## [41] glue_1.8.0 systemfonts_1.3.1 xfun_0.54 tidyselect_1.2.1
## [45] rstudioapi_0.17.1 farver_2.1.2 nlme_3.1-168 htmltools_0.5.8.1
## [49] labeling_0.4.3 rmarkdown_2.30 svglite_2.2.2 compiler_4.5.2
## [53] S7_0.2.0
End of Week 02: R for Data Analytics Tutorial
ANLY 500 - Analytics I
Harrisburg University